Loan Default Classification

Machine learning in the FinTech industry
Problem statement: The company is facing challenges in the loan procedure, specifically the issue of defaulters. The company wants to transform its loan procedure using Machine Learning model to improve decision-making and prevent defaulters.
Solution: The company can use AI and automation to analyze customer data in real-time and predict the likelihood of a customer defaulting on their loan. Machine learning models such as logistic regression, decision trees, random forests, and neural networks can be used to make better decisions regarding loan approval. The data analytics team should consider relevant parameters such as having enough data to train the model accurately, choosing the right features to predict the likelihood of default, and considering the performance metrics to evaluate the model. The future of AI and automation in the loan procedure looks promising as it can improve decision-making, reduce the risk of defaulters, and provide better customer service. This can have a positive impact on the economy by improving the financial health of individuals and businesses, leading to increased investment and economic growth.

how can our company transform with the use of AI and automation to solve the loan procedure and prevent defaulters?
Overview of how we could implement this: With the help of AI and automation, we can analyze customer data in real-time and predict the likelihood of a customer defaulting on their loan. We can use machine learning models such as logistic regression, decision trees, random forests, and neural networks to make better decisions regarding loan approval.
What parameters should we consider when building these machine learning models?
Data analytics team should consider several parameters when building these models. Firstly, we need to ensure that we have enough relevant data to train the model accurately. Secondly, we need to choose the right features to predict the likelihood of default. And finally, we need to consider the performance metrics such as precision, recall, and accuracy that we're using to evaluate the model.
What do you think the future of AI and automation in the loan procedure looks like?
I believe the future looks very promising for AI and automation in the loan procedure. As we continue to collect more data and improve our machine learning models, we'll be able to make better decisions regarding loan approval and reduce the risk of defaulters. Additionally, we can provide better customer service through automation and personalized loan servicing, which could lead to higher customer satisfaction and loyalty.
How do you think this will affect the recession in the market?
I think this will have a positive impact on the economy. By reducing the risk of defaulters, the financial health of individuals and businesses will improve, which could lead to increased investment and economic growth. It's a win-win situation for both our company and the economy.

MLOps
The MLOps pipeline can involve data collection and preprocessing, feature engineering, model training, model deployment, and monitoring. The pipeline can be automated using tools such as docker, Azure ML and Kubernetes. The pipeline can be deployed on cloud infrastructure such as AWS, GCP, or Azure.

About Dataset
A simulated financial dataset has been generated using genuine information from a financial organization. The dataset has been altered to eliminate any identifying characteristics and the figures have been altered to prevent any linkage to the original source (the financial institution). The purpose of using this dataset is to give trainee a simple financial dataset to use when practicing financial analytics for POC.

Highlights of the Loan Default Classification:
-Classification, Imbalanced Data, and PR Curve
Contents

  1. EDA
    Check NaN values
    Data overview
    Feature engineering
    Data distribution

  2. Modeling
    Train test split
    Standardization
    Upsampling by SMOTE
    Logistic regression
    Support vector machine
    Random forest
    LightGBM
    XGBoost
    Model assessment
    ROC curve
    PR curve

  3. Conclusion
# EDA # Check NaN values

Data overview

The column labeled "Employed" is of categorical type, while the "Bank Balance" and "Annual Salary" columns are numerical. Our objective is to perform a binary classification task based on the target column "Defaulted."

Feature engineering

We generate a new feature named "Saving Rate" based on the "Bank Balance" and "Annual Salary" data. The Saving Rate feature provides insight into the spending habits of each user. Generally, a user with a higher Saving Rate is considered less likely to default. We will investigate the relationship between these variables in greater detail later on.

Data distribution

Default distribution

Loan defaults would only impact 3% of customers, creating in an imbalanced classification.

Employed distribution

Contingency table

Pearson’s χ2 test for independence

Conclusion: As their p-value is between 0.0005 and 0.05, we draw the conclusion that they are not independent. Employed status can therefore be used to predict default.

Bank Balance distribution

We find that this is an asymmetric distribution, with many people having zero bank balance.

Let's further check this by calculating number of accounts with less than 10 dollars.

Conclusion:

Approximately 500 individuals have hardly saved any money in their bank accounts, which could pose a risk for loan defaults. Surprisingly, those who have defaulted on their loans tend to have a higher balance in their bank accounts. This observation may seem counterintuitive and suggests the presence of confounding factors. It is possible that individuals with a higher bank balance may have easier access to loans, leading to a higher number of defaults.

Annual Salary distribution

Conclusion:

  1. In comparison to bank balance, there are fewer outliers when it comes to annual salary.
  2. Default cases appear to be distributed across all annual salary ranges, suggesting that annual salary may not be a reliable predictor of loan defaults.

Saving Rate distribution

Conclusion:

The distribution of saving rate is similar to that of bank balance, but with a few extreme outliers. This suggests that people's saving habits can vary significantly. Some individuals may earn a high income but spend more than they save, while others with relatively low salaries may have a significant amount of savings.

Modeling

Train test split

Standardization

Upsampling by SMOTE

During the Exploratory Data Analysis (EDA) phase, it was observed that defaulted cases constituted only 3% of the samples. This highly imbalanced dataset could pose a challenge for classification models that aim to minimize the cost function. To address this issue, the SMOTE upsampling method was introduced to rebalance the dataset.

Classification

The models we will examine include </br> Logistic Regression, </br> Support Vector Machine, </br> Random Forest, LightGBM, and </br> XGboost. </br> Our primary metric for optimization is the Recall Rate for predicting defaulted cases. </br> This is because for a bank loan default problem, rejecting loans falsely only leads to potential interest loss, </br> while the default of a loan leads to a significant loss of all principal.</br>

Logistic regression

Cross validation

First prediction result

Hyperparameter tuning

distributions = dict(C=np.linspace(2, 1000, 100), penalty=['l2', 'l1'])

clf = RandomizedSearchCV(LogisticRegression(solver='saga',random_state=RAND_SEED), distributions, scoring='recall', n_iter=100, n_jobs = -1, random_state=RAND_SEED) clf_logistic = clf.fit(X_train, y_train) clf_logistic.bestparams

{'penalty': 'l2', 'C': 254.02020202020202}

Tuned prediction result

Support vector machine

Cross validation

First prediction result

Hyperparameter tuning

distributions = dict(C=np.logspace(0, 4, 50), degree = np.linspace(1,10,1), class_weight = [None, 'balanced'], )

{'degree': 1.0, 'class_weight': None, 'C': 494.1713361323833}

Tuned prediction result

Random forest

Cross validation

First prediction result

Hyperparameter tuning

distributions = dict(n_estimators=np.arange(10, 500, 10), criterion=['gini', 'entropy'], max_depth = range(20), min_samples_split = range(2, 20), min_samples_leaf = range(3, 50), bootstrap = [True, False], class_weight = ['balanced', 'balanced_subsample'] )

clf = RandomizedSearchCV(RandomForestClassifier(), distributions, scoring='recall', n_iter=20, n_jobs = 4, random_state=RAND_SEED) clf_random_forest = clf.fit(X_train, y_train) clf_random_forest.bestparams

{'n_estimators': 490, 'min_samples_split': 14, 'min_samples_leaf': 5, 'max_depth': 8, 'criterion': 'gini', 'class_weight': 'balanced_subsample', 'bootstrap': False}

Tuned prediction result

LightGBM

Cross validation

First prediction result

Hyperparameter tuning

distributions = { 'learning_rate': np.logspace(-5, 2, 50), 'num_leaves': np.arange(10, 100, 10), 'max_depth' : np.arange(3, 13, 1), 'colsample_bytree' : np.linspace(0.1, 1, 10), 'min_split_gain' : np.linspace(0.01, 0.1, 10), }

clf = RandomizedSearchCV(lgb.LGBMClassifier(), distributions, scoring='recall', n_iter=100, n_jobs = 4, random_state=RAND_SEED) clf_lgb = clf.fit(X_train, y_train) clf_lgb.bestparams

{'num_leaves': 60, 'min_split_gain': 0.030000000000000006, 'max_depth': 8, 'learning_rate': 0.07196856730011514, 'colsample_bytree': 0.7000000000000001}

Tuned prediction result

XGBoost

Cross validation

First prediction result

Hyperparameter tuning

distributions = { 'n_estimators': np.arange(100, 1000, 100), 'max_depth':np.arange(2,10,1), 'learning_rate':np.logspace(-4, 1, 50), 'subsample':np.linspace(0.1, 1, 10), 'colsample_bytree':np.linspace(0.1, 1, 10), }

clf = RandomizedSearchCV(XGBClassifier(), distributions, scoring='recall', n_iter=10, n_jobs = 4, random_state=RAND_SEED) clf_xgb = clf.fit(X_train, y_train) clf_xgb.bestparams

{'subsample': 0.9, 'n_estimators': 600, 'max_depth': 8, 'learning_rate': 0.008685113737513529, 'colsample_bytree': 0.6}

Tuned prediction result

Model assessment

ROC curve

Given the imbalanced nature of our dataset, our emphasis is on the precision-recall curve. Based on the test set outcome, it can be concluded that the Logistic regression model performed well.

Conclusion

The purpose of this notebook is to work with an imbalanced loan default dataset using multiple ML models. Our findings reveal that the Random Forest model achieved the highest Recall rate of 89% on the test set. However, the Logistic Regression model surpassed all other models with the top AUC score of 0.5238 in the precision-recall curve. With the addition of more features and feature engineering, there is potential to further enhance the results in the future.